Goto

Collaborating Authors

 survey response



Questioning the Survey Responses of Large Language Models

Neural Information Processing Systems

Surveys have recently gained popularity as a tool to study large language models. By comparing models' survey responses to those of different human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine language models' survey responses on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A".


QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

Kreutner, Maximilian, Rupprecht, Jens, Ahnert, Georg, Salem, Ahmed, Strohmaier, Markus

arXiv.org Artificial Intelligence

We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation ($>40 $ million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers, and can be obtained for a fraction of the compute cost. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.


Prompt Perturbations Reveal Human-Like Biases in Large Language Model Survey Responses

Rupprecht, Jens, Ahnert, Georg, Strohmaier, Markus

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known human-like response biases, such as central tendency, opinion floating and primacy bias are poorly understood. This work investigates the response robustness of LLMs in normative survey contexts, we test nine LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of ten perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated survey interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also show that all tested models exhibit a consistent recency bias, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.


Simulating Persuasive Dialogues on Meat Reduction with Generative Agents

Ahnert, Georg, Wurth, Elena, Strohmaier, Markus, Mata, Jutta

arXiv.org Artificial Intelligence

Meat reduction benefits human and planetary health, but social norms keep meat central in shared meals. To date, the development of communication strategies that promote meat reduction while minimizing social costs has required the costly involvement of human participants at each stage of the process. We present work in progress on simulating multi-round dialogues on meat reduction between Generative Agents based on large language models (LLMs). We measure our main outcome using established psychological questionnaires based on the Theory of Planned Behavior and additionally investigate Social Costs. We find evidence that our preliminary simulations produce outcomes that are (i) consistent with theoretical expectations; and (ii) valid when compared to data from previous studies with human participants. Generative agent-based models are a promising tool for identifying novel communication strategies on meat reduction -- tailored to highly specific participant groups -- to then be tested in subsequent studies with human participants.



Modeling User Behavior from Adaptive Surveys with Supplemental Context

Shukla, Aman, Scantlebury, Daniel Patrick, Kumar, Rishabh

arXiv.org Artificial Intelligence

Modeling user behavior is critical across many industries where understanding preferences, intent, or decisions informs personalization, targeting, and strategic outcomes. Surveys have long served as a classical mechanism for collecting such behavioral data due to their interpretability, structure, and ease of deployment. However, surveys alone are inherently limited by user fatigue, incomplete responses, and practical constraints on their length making them insufficient for capturing user behavior. In this work, we present LANTERN (Late-Attentive Network for Enriched Response Modeling), a modular architecture for modeling user behavior by fusing adaptive survey responses with supplemental contextual signals. We demonstrate the architectural value of maintaining survey primacy through selective gating, residual connections and late fusion via cross-attention, treating survey data as the primary signal while incorporating external modalities only when relevant. LANTERN outperforms strong survey-only baselines in multi-label prediction of survey responses. We further investigate threshold sensitivity and the benefits of selective modality reliance through ablation and rare/frequent attribute analysis. LANTERN's modularity supports scalable integration of new encoders and evolving datasets. This work provides a practical and extensible blueprint for behavior modeling in survey-centric applications.


Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data

Zhang, Jihong, Liang, Xinya, Deng, Anqi, Bonge, Nicole, Tan, Lin, Zhang, Ling, Zarrett, Nicole

arXiv.org Artificial Intelligence

Mixed methods research integrates quantitative and qualitative data but faces challenges in aligning their distinct structures, particularly in examining measurement characteristics and individual response patterns. Advances in large language models (LLMs) offer promising solutions by generating synthetic survey responses informed by qualitative data. This study investigates whether LLMs, guided by personal interviews, can reliably predict human survey responses, using the Behavioral Regulations in Exercise Questionnaire (BREQ) and interviews from after-school program staff as a case study. Results indicate that LLMs capture overall response patterns but exhibit lower variability than humans. Incorporating interview data improves response diversity for some models (e.g., Claude, GPT), while well-crafted prompts and low-temperature settings enhance alignment between LLM and human responses. Demographic information had less impact than interview content on alignment accuracy. These findings underscore the potential of interview-informed LLMs to bridge qualitative and quantitative methodologies while revealing limitations in response variability, emotional interpretation, and psychometric fidelity. Future research should refine prompt design, explore bias mitigation, and optimize model settings to enhance the validity of LLM-generated survey data in social science research.


Questioning the Survey Responses of Large Language Models

Neural Information Processing Systems

Surveys have recently gained popularity as a tool to study large language models. By comparing models' survey responses to those of different human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine language models' survey responses on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A".


Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy

Karanjai, Rabimba, Shor, Boris, Austin, Amanda, Kennedy, Ryan, Lu, Yang, Xu, Lei, Shi, Weidong

arXiv.org Artificial Intelligence

This paper investigates the use of Large Language Models (LLMs) to synthesize public opinion data, addressing challenges in traditional survey methods like declining response rates and non-response bias. We introduce a novel technique: role creation based on knowledge injection, a form of in-context learning that leverages RAG and specified personality profiles from the HEXACO model and demographic information, and uses that for dynamically generated prompts. This method allows LLMs to simulate diverse opinions more accurately than existing prompt engineering approaches. We compare our results with pre-trained models with standard few-shot prompts. Experiments using questions from the Cooperative Election Study (CES) demonstrate that our role-creation approach significantly improves the alignment of LLM-generated opinions with real-world human survey responses, increasing answer adherence. In addition, we discuss challenges, limitations and future research directions.